Goal of the assignment is to apply BSCA, testing 3 different values for both of the clustering hyperparameters, $\alpha$ and $M$ , to the 1D and 2D projections of the Stellar Classification Dataset from both PCA and LD techniques which was apart of the Dimensionality Reduction assignment.
Clustering is an unsupervised technique that aims to form groupings of data points without any class knowledge. This technique is not classification, because we want to discover groupings of data independent of their respective classes. Since this process is unsupervised it can be common to get results that have a different number of groupings in comparison with the amount of unique classes within the data. Determining what data point belong within each cluster and the number of cluster is based on a chosen distance measurement between data points and the clusters as well as a specified distance threshold and max cluster number. When implementing clustering techniques the high dimensionality of data can make the process difficult. Sometimes it might be necessary to perform dimensionality reduction techniques before the application of a clustering technique. It can also be tricky to find the correct number of clusters as well as choosing the distance threshold. When implementing clustering the threshold of distance ($\alpha$) parameter and max number of clusters ($M$) parameter must be specified. You can initialize the first point as the first cluster and then loop through the rest of the points, or samples until the end. While iterating through the points the minimum distance between each sample point and the cluster(s) will be computed, and if it falls beyond the specified distance threshold and the current cluster total is less than $M$ then a new cluster can be formed from that point. If the minimum distance of a point with one of the clusters cluster falls below the distance threshold, then that point is then added to the cluster it is closest too. This process iterates through all samples adding each one to a cluster or creating a new cluster until all samples distances have been measured and grouped.
User defined parameters... $$\alpha = \text{distance threshold}$$ $$ M = \text{maximum cluster count} $$ Initialize the cluster total $t$ as 1 and the first sample, $\vec{X_i}$ as the first cluster, $C_1$ $$ t = 1;\quad \text{where}\; t = \text{cluster total} $$ $$\vec{X_1} = C_1$$ For all samples, $\vec{X_2}...\vec{X_N}$, find the minimum distance to each cluster $d_{min}(\vec{X_i}, C_j)$ $$ d_{min}(\vec{X_i}, C_j)\;; \quad i = 2,...,N $$ If the distance is beyond the threshold $\alpha$, and the cluster total ($t$) is less than the threshold $M$ then create a new cluster from the data point and update the total cluster number. $$ IF \quad d(\vec{X_i}, C_j) > \alpha \quad AND\quad t < M $$ then, $$ t = t + 1 $$ $$ C_t = \vec{X_i} $$ If the distance is not beyond the threshold ($\alpha$) then add that point to the corresponding cluster ($C_j$) with the smallest distance measure $$ IF \quad d(\vec{X_i}, C_j) < \alpha $$ then, $$ C_j = C_j \cup \vec{X_i} $$
Continue finding the minimum distances and assigning samples to the closest cluster or create a new cluster from samples until all samples have been assigned to a cluster
# import libraries and data
import numpy as np
import pandas as pd
import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
plotly
from dims_reduction import PCA, LDA
# Load data
DATA = 'data/star_classification.csv'
stellar_df = pd.read_csv(DATA)
labels_df = pd.DataFrame(stellar_df['class'])
X_pca = stellar_df.drop(columns='class')
# 1D PCA
# create the one dimensional data and set the column to be the 1st principal component
pca_one = pd.DataFrame(PCA(X_pca, 1), columns=['PC1'])
# join the classes back to the observations
#pca_one_df = pd.concat([pca_one_df, labels_df], axis=1)
# 2D PCA
# create the one dimensional data and set the column to be the 1st principal component
pca_two = pd.DataFrame(PCA(X_pca, 2), columns=['PC1', 'PC2'])
# join the classes back to the observations
#pca_two_df = pd.concat([pca_two_df, labels_df], axis=1)
print("1D PCA shape: {}\n2D PCA shape: {}".format(pca_one.shape, pca_two.shape))
1D PCA shape: (100000, 1) 2D PCA shape: (100000, 2)
# drop id fields
columns = ['class','obj_ID', 'run_ID','rerun_ID', 'field_ID', 'spec_obj_ID', 'MJD', 'fiber_ID', 'cam_col', 'plate']
X_lda = stellar_df.drop(columns=columns)
labels = stellar_df['class']
#1D LDA
lda_one = LDA(data=X_lda, labels=labels, n=1)
#2D LDA
lda_two = LDA(data=X_lda, labels=labels, n=2)
print("1D LDA shape: {}\n2D LDA shape: {}".format(lda_one.shape, lda_two.shape))
1D LDA shape: (100000, 1) 2D LDA shape: (100000, 2)
# for calculating the distance from a point (row) to the passed cluster
def clust_distance(X, cluster):
# 1D projection, distance calc
if len(X) == 1: # if there is one column..
distance = abs(X - np.mean(cluster))
# 2D projection
elif len(X) == 2: # 2 columns
cluster_means = np.mean(np.array(cluster), axis=0)
distance = np.sqrt((cluster_means[0] - X[0])**2 + (cluster_means[1] - X[1])**2)
else:
print("could not find distance to cluster")
return distance
def BSCA(data, m, alpha):
'''
Arguments:
data -> df of data to cluster, observations row wise
-> int, number of clusters to find within data
alpha -> int, distance threshold,
determines how far away a point can be before it must create a new cluster
Returns:
cluster_df -> df with a value columns and a column corresponding to its cluster
'''
# Initialize the cluster total (t) as 1
t = 1
# initialize cluster dictionary that will hold the samples in each cluster
clusters = dict(zip(range(m), [[] for i in range(m)]))
# initialize the first sample, X_i as the first cluster, C_1
clusters[0].append(data.loc[0, :].to_numpy())
print("initialize a new cluster: cluster 0 ")
# For all samples (X_2 ... X_n) find the min distance to each cluster d_min(X_i, C_j)
for row in data.loc[1:, :].to_numpy():
c = 0
curr_clusters = t
# init distance dict for this sample
distance = dict(zip(range(curr_clusters), [
[] for i in range(curr_clusters)]))
# loop through current clusters computing the distance to each and appending to the distance dict
for c in range(curr_clusters):
distance[c].append(clust_distance(row, clusters[c]))
#distance[c].append(abs(row-np.mean(clusters[c])))
# calculate the shortest distance to a cluster
shortest_dist = min(distance.values())
# find the closest cluster based using shortest distance
closest_cluster = list(distance.keys())[list(
distance.values()).index(shortest_dist)]
# If the distance is large than alpha and t<m set the sample to the new cluster
if t < m and shortest_dist[0] > alpha:
print('A new cluster has been made: cluster ' + str(c + 1))
clusters[curr_clusters].append(row)
t = t+1
# Else -> add the sample to the clusters dictionary under the key for the closest cluster
else:
clusters[closest_cluster].append(row)
print("Clustering complete...")
# create df from cluster dictionary
cluster_df = pd.DataFrame()
x = 0 # counter for c
for c in clusters:
j = 0 # counter for number of row
for i in clusters[c]:
# append the cluster number the row belongs to at the end of the row
row = pd.DataFrame(np.append(clusters[x][j], x)).T
# df of size (examples, data[0].size + 1)
cluster_df = pd.concat([cluster_df, row], ignore_index=True)
j += 1
x += 1
return cluster_df
# sample 20% of data to save clustering computation time
# computation time clustering of 100,000 points: 30 min
# computation time clustering of 100,000 points: 1 min
pca_one_subset = pca_one[::5]
pca_two_subset = pca_two[::5]
lda_one_subset = lda_one[::5]
lda_two_subset = lda_two[::5]
projections = [pca_one_subset, pca_two_subset, lda_one_subset, lda_two_subset]
m = [2, 3, 4]
alpha = [0.4, 0.6, 0.8]
k = 1
for x in range(len(projections)):
for i in range(len(m)):
for j in range(len(alpha)):
print('\nClustering combo #{}'.format(k))
df = BSCA(projections[x], m[i], alpha=alpha[i])
df['m'] = m[i]
df['alpha'] = alpha[j]
if x ==0 :
path = 'PCA_1D'
elif x == 1:
path = 'PCA_2D'
elif x == 2:
path = 'LDA_1D'
elif x == 3:
path = 'LDA_2D'
df.to_csv("data/clustering/{}/m{}_a{}.csv".format(path,i,j))
k += 1
Clustering combo #1 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #2 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #3 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #4 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #5 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #6 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #7 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #8 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #9 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #10 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #11 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #12 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #13 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #14 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #15 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #16 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 A new cluster has been made: cluster 3 Clustering complete... Clustering combo #17 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 A new cluster has been made: cluster 3 Clustering complete... Clustering combo #18 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 A new cluster has been made: cluster 3 Clustering complete... Clustering combo #19 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #20 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #21 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #22 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #23 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #24 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #25 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #26 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #27 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #28 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #29 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #30 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 Clustering complete... Clustering combo #31 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #32 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #33 initialize a new cluster: cluster 0 A new cluster has been made: cluster 1 A new cluster has been made: cluster 2 Clustering complete... Clustering combo #34 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #35 initialize a new cluster: cluster 0 Clustering complete... Clustering combo #36 initialize a new cluster: cluster 0 Clustering complete...
# merges dfs from a list into one df
def merge_dfs(dfs_list):
df = pd.DataFrame()
for i in range(len(dfs_list)):
df = pd.concat([df,dfs_list[i]])
return df
# loads csvs as dfs for plotting
def load_files(path, file_names, merge=False):
'''
loads the csv files for the given projection specified by the file path and file_names list
returns a list of dfs for that directory
if merge == true then a singular merged dataframe will passed instead of a list of dfs
'''
df_list = []
for i in range(len(file_names)):
temp_df = pd.read_csv(path + file_names[i] + '.csv', index_col=0)
df_list.append(temp_df)
if merge:
return merge_dfs(df_list)
else:
return df_list
def compare_plots(df, pc_num, title):
'''
for plotting all clustering combinations obtained from load_files and merge_dfs
df = merged df with columns m and alpha
pc_num = num of principle components
title = desired title of fig
'''
#plot for 1D projections
if pc_num == 1:
fig = px.scatter(df, x='0', color='1', facet_row='m', facet_col='alpha',
labels={
'0': 'PC1',
'1': 'Cluster'})
for axis in fig.layout:
if type(fig.layout[axis]) == go.layout.YAxis:
fig.layout[axis].title.text = ''
# plot for 2D projections
elif pc_num == 2:
fig = px.scatter(df, x='0', y='1', color='2', facet_row='m', facet_col='alpha',
labels={
'0': 'PC1',
'1' : 'PC2',
'2': 'Cluster'})
fig.update_layout(title=title, title_x=0.5, title_y=.98, margin_l=30, margin_b=60)
return fig
# load dfs from directory and append to df_list for plotting
path = 'data/clustering/PCA_1D/'
file_names = ['m0_a0', 'm0_a1', 'm0_a2', 'm1_a0',
'm1_a1', 'm1_a2', 'm2_a0', 'm2_a1', 'm2_a2']
pca_1d_clusts = load_files(path=path, file_names=file_names, merge=True)
pca_1d_plot = compare_plots(pca_1d_clusts, pc_num=1, title='PCA-1D Clustering')
plotly.offline.plot(pca_1d_plot, filename='figures/clustering/pca_1d_plot.html')
pca_1d_plot.show()
path = 'data/clustering/LDA_1D/'
lda_1d_clusts = load_files(path=path, file_names=file_names, merge=True)
lda_1d_plot = compare_plots(lda_1d_clusts, pc_num=1, title='LDA-1D Clustering')
plotly.offline.plot(lda_1d_plot, filename='figures/clustering/lda_1d_plot.html')
lda_1d_plot.show()
path = 'data/clustering/PCA_2D/'
pca_2d_clusts = load_files(path=path, file_names=file_names, merge=True)
pca_2d_plot = compare_plots(pca_2d_clusts, pc_num=2, title='PCA-2D Clustering')
plotly.offline.plot(pca_2d_plot, filename='figures/clustering/pca_2d_plot.html')
pca_2d_plot.show()
path = 'data/clustering/LDA_2D/'
lda_2d_clusts = load_files(path=path, file_names=file_names, merge=True)
lda_2d_plot = compare_plots(lda_2d_clusts, pc_num=2, title='LDA-2D Clustering')
plotly.offline.plot(lda_2d_plot, filename='figures/clustering/lda_2d_plot.html')
lda_2d_plot.show()
# for calculating the distance from a point (row) to the passed cluster
def clust_distance(X, cluster):
# 1D projection, distance calc
if len(X) == 1: # if there is one column..
distance = abs(X - np.mean(cluster))
# 2D projection
elif len(X) == 2: # 2 columns
cluster_means = np.mean(np.array(cluster), axis=0)
distance = np.sqrt((cluster_means[0] - X[0])**2 + (cluster_means[1] - X[1])**2)
else:
print("could not find distance to cluster")
return distance
# Clustering algo
def BSCA(data, m, alpha):
'''
Arguments:
data -> df of data to cluster, observations row wise
-> int, number of clusters to find within data
alpha -> int, distance threshold,
determines how far away a point can be before it must create a new cluster
Returns:
cluster_df -> df with a value columns and a column corresponding to its cluster
'''
# Initialize the cluster total (t) as 1
t = 1
# initialize cluster dictionary that will hold the samples in each cluster
clusters = dict(zip(range(m), [[] for i in range(m)]))
# initialize the first sample, X_i as the first cluster, C_1
clusters[0].append(data.loc[0, :].to_numpy())
print("initialize a new cluster: cluster 0 ")
# For all samples (X_2 ... X_n) find the min distance to each cluster d_min(X_i, C_j)
for row in data.loc[1:, :].to_numpy():
c = 0
curr_clusters = t
# init distance dict for this sample
distance = dict(zip(range(curr_clusters), [
[] for i in range(curr_clusters)]))
# loop through current clusters computing the distance to each and appending to the distance dict
for c in range(curr_clusters):
distance[c].append(clust_distance(row, clusters[c]))
#distance[c].append(abs(row-np.mean(clusters[c])))
# calculate the shortest distance to a cluster
shortest_dist = min(distance.values())
# find the closest cluster based using shortest distance
closest_cluster = list(distance.keys())[list(
distance.values()).index(shortest_dist)]
# If the distance is large than alpha and t<m set the sample to the new cluster
if t < m and shortest_dist[0] > alpha:
print('A new cluster has been made: cluster ' + str(c + 1))
clusters[curr_clusters].append(row)
t = t+1
# Else -> add the sample to the clusters dictionary under the key for the closest cluster
else:
clusters[closest_cluster].append(row)
print("Clustering complete...")
# create df from cluster dictionary
cluster_df = pd.DataFrame()
x = 0 # counter for c
for c in clusters:
j = 0 # counter for number of row
for i in clusters[c]:
# append the cluster number the row belongs to at the end of the row
row = pd.DataFrame(np.append(clusters[x][j], x)).T
# df of size (examples, data[0].size + 1)
cluster_df = pd.concat([cluster_df, row], ignore_index=True)
j += 1
x += 1
return cluster_df
# concats the dataframes of a directory into one df for plotting all combinations
# called within load_files of merge =True
def merge_dfs(dfs_list):
df = pd.DataFrame()
for i in range(len(dfs_list)):
df = pd.concat([df,dfs_list[i]])
return df
# loads csvs as dfs for plotting
def load_files(path, file_names, merge=False):
'''
loads the csv files for the given projection specified by the file path and file_names list
returns a list of dfs for that directory
if merge == true then a singular merged dataframe will passed instead of a list of dfs
'''
df_list = []
for i in range(len(file_names)):
temp_df = pd.read_csv(path + file_names[i] + '.csv', index_col=0)
df_list.append(temp_df)
if merge:
return merge_dfs(df_list)
else:
return df_list
# plots all clustering combinations based on m and alpha
def compare_plots(df, pc_num, title):
'''
df = merged df with columns m and alpha
pc_num = num of principle components
title = desired title of fig
'''
#plot for 1D projections
if pc_num == 1:
fig = px.scatter(df, x='0', color='1', facet_row='m', facet_col='alpha',
labels={
'0': 'PC1',
'1': 'Cluster'})
for axis in fig.layout:
if type(fig.layout[axis]) == go.layout.YAxis:
fig.layout[axis].title.text = ''
# plot for 2D projections
elif pc_num == 2:
fig = px.scatter(df, x='0', y='1', color='2', facet_row='m', facet_col='alpha',
labels={
'0': 'PC1',
'1' : 'PC2',
'2': 'Cluster'})
fig.update_layout(title=title, title_x=0.5, title_y=.98, margin_l=30, margin_b=60)
return fig